Word Segmentation in the Spoken Dutch Corpus
نویسندگان
چکیده
ELIS, University of Ghent, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium martens,odul,rvparijs @elis.rug.ac.be Dept Language & Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands [email protected] ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium kris.demuynck,tom.laureys,jacques.duchateau @esat.kuleuven.ac.be Abstract This paper describes the aims of the word segmentation in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually verified segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory verified phonetic transcription, and the development of a protocol for the manual verification of that automatic segmentation. The paper also mentions some figures concerning the manual verification of the first hundred thousand words.
منابع مشابه
Assessing Segmentations: Two Methods for Confidence Scoring Automatic HMM-Based Word Segmentations
The Dutch-Flemish project Spoken Dutch Corpus (1998-2003) aims at the development of an annotated corpus of 10 million spoken words. In order to make the speech data easily accessible, a word segmentation couples the orthographic transcription to the speech signal by means of time stamps. Generally, such segmentations are produced manually. Since this manual procedure is a time-consuming effort...
متن کاملBuilding a corpus of spoken Dutch
In this paper the Spoken Dutch Corpus Project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overview of the project. It then goes ...
متن کاملAutomatic Phonemic Labeling and Segmentation of Spoken Dutch
The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of th...
متن کاملHarvesting Dutch Trees: Syntactic Properties of Spoken Dutch
In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...
متن کاملThe Spoken Dutch Corpus. Overview and First Evaluation
In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002